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ABSTRACT 

ChemMine Tools is an online service for small mol- 
ecule data analysis. It provides a web interface to a 
set of cheminformatics and data mining tools that 
are useful for various analysis routines performed in 
chemical genomics and drug discovery. The service 
also offers programmable access options via the 
R library ChemmineR. The primary functionalities 
of ChemMine Tools fall into five major application 
areas: data visualization, structure comparisons, 
similarity searching, compound clustering and 
prediction of chemical properties. First, users 
can upload compound data sets to the online 
Compound Workbench. Numerous utilities are 
provided for compound viewing, structure drawing 
and format interconversion. Second, pairwise 
structural similarities among compounds can be 
quantified. Third, interfaces to ultra-fast structure 
similarity search algorithms are available to effi- 
ciently mine the chemical space in the public 
domain. These include fingerprint and embedding/ 
indexing algorithms. Fourth, the service includes a 
Clustering Toolbox that integrates cheminformatic 
algorithms with data mining utilities to enable sys- 
tematic structure and activity based analyses of 
custom compound sets. Fifth, physicochemical 
property descriptors of custom compound sets 
can be calculated. These descriptors are important 
for assessing the bioactivity profile of compounds 
in silico and quantitative structure— activity relation- 
ship (QSAR) analyses. ChemMine Tools is available 
at: http://chemmine.ucr.edu. 

INTRODUCTION 

Cheminformatics tools for analyzing small molecule 
screening data play an important role in many fields 



including chemical biology, chemical genomics, drug 
discovery and agrochemical research (1-3). Informatics 
resources in these areas are essential for exploring the 
structure, properties and bioactivity of biologically 
relevant molecules. To provide these capabilities, 
software tools are required for analyzing the structural 
similarities, physicochemical properties and bioactivity 
profiles of natural and synthetic compounds to gain 
insight into their modes of action in biological systems. 
This information is important for the development of 
effective small molecule probes for studying the functions 
of protein and cellular networks in chemical genomics 
and drug discovery research (4). In addition, similar 
informatics resources are required for identifying the 
structural and physicochemical relationships among 
compounds from metabolic or signaling pathways (5-7). 
The rapidly growing relevance of chemical genomics 
approaches for modern biology research has significantly 
increased demand for small molecule mining systems in 
academia (8). 

Currently, the structures of over 30 million distinct 
small molecules are available in open-access databases, 
including PubChem, ChemBank and many others (9-15). 
In addition, preliminary bioactivity data from hundreds of 
high-throughput screening (HTS) experiments against a 
wide spectrum of target sites have become available for 
almost one million compounds in the bioassay sections 
of various public databases (see below; 9,10,15,16). 
To efficiently analyze these resources, the development 
of novel compound data mining and cheminformatic 
web services is essential. 

While there has been extensive development of public 
domain small molecule databases in recent years (6,9-11, 
13-24), the number of open access web services for 
analyzing public or custom small molecule data is ex- 
tremely limited at this point (25,26). Thus far, most devel- 
opment has been focused on standalone software 
applications targeted toward computational rather than 
experimental scientists. These include Open Babel 
(27,28), the Chemistry Development Kit (29,30), the 
Chemical Descriptors Library (31) and JOELib (32). 
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Figure 1. Illustration of the functionalities provided by ChemMine Tools. The utilities of the five application domains (i-v) are listed in more detail 
in Table 1. 



Examples of software designed for non-expert users in 
this field are Chembench (33) for online quantitative 
structure — activity relationship (QSAR) modeling and 
KNIME (34) for designing data analysis pipelines. 

Here, we present ChemMine Tools as an online portal to 
a variety of cheminformatics, visualization, search and 
clustering tools for small molecule data. The utilities 
provided by this service are useful for various analysis 
and data mining routines of small molecule screening ex- 
periments in chemical genomics and related areas. An easy 
to use web interface makes these tools accessible to experi- 
mental scientists without an extensive computational 
background. 

METHODS 

Conceptually, the ChemMine Tools online service is 
divided into five application domains (Figure 1 and 
Table 1): (i) a Compound workbench for data imports 
and result management; (ii) a Structure Similarity 
toolbox to quantify the similarities among compounds; 
(hi) a Search toolbox for retrieving similar compounds 
from PubChem; (iv) a Clustering toolbox for accessing 
clustering and data visualization tools; and (v) a 
Property toolbox for predicting physicochemical 
properties of compounds. To construct robust data 
analysis workflows, the back-end of the server employs a 
modular design architecture with object-oriented methods 
and container classes assuring compatible input/output 
flows and parameter settings among the different data 
processing units. Currently, the server integrates over 
30 cheminformatics and data mining tools that were 
developed by this or related open source projects. The 
modular organization of the ChemMine Tools service 



has several advantages. For instance, it maximizes the 
transparency and maintainability of the system, and 
simplifies the addition of new features and analysis 
methods upon user request. The web interface of 
ChemMine Tools is written in Python using the 
object-oriented and highly scalable Django web frame- 
work. Modern JavaScript /Ajax utilities are embedded to 
generate interactive and customizable high-content web 
pages. Moreover, the ChemMine Tools project is dedicated 
to an open access and resource sharing policy. All of its 
online services and downloadable software components 
are freely available without restrictions. The following 
subsections give a detailed description of the underlying 
algorithms and software tools used by the individual 
ChemMine Tools services. 

DISCUSSION OF SERVICES 

Compound workbench 

A central feature of ChemMine Tools is its Compound 
workbench. It provides a flexible online workspace to 
upload, manage and visualize small molecule data. 
Compounds can be imported by reading them from 
local files, copy and paste, PubChem queries (see Search 
toolbox) or by interacting with the service through the 
ChemmineR library (35) within the statistical 
programming environment R. The latter is an extension 
of the ChemMine Tools project to provide a program- 
mable interface to more advanced users. Alternatively, 
compounds can be drawn online with the JME 
Molecular Editor (36) and then added to the Compound 
workbench. Currently, the import utility supports the 
structure data format (SDF) and simplified molecular 
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Table 1. List of services provided by ChemMine Tools 
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input line entry system (SMILES). After the import, one 
can organize and annotate the compounds or view their 
structure images in single or batch modes. These images 
are generated in real time from the underlying structure 
definition data using the structure depiction tool of the 
CACTVS software suite (11) which runs on the server 
side. To revisit instances of compound sets, users can 
save their workbench for later use by downloading the 
compounds to local files. The compound download 
function also serves as a format conversion tool to inter- 
convert structure representations between SDF and 
SMILES formats using utilities from the Open Babel 
project (27,28). Once the user has populated the 
Compound workbench with structures, it serves as a 
central submission system to all downstream analysis 
services. 

Similarity toolbox 

In many small molecule screening data analysis routines it 
is important to compute objective similarity measures 
among compounds as a means to compare and prioritize 
structurally related lead compounds. To provide this func- 
tionality, ChemMine Tools has implemented two algo- 
rithms for computing similarity coefficients among 
compound structures. The first employs atom pairs as 
structural descriptors (37) and the widely used Tanimoto 
coefficient as a similarity measure (see below for more 
details). Alternatively, users can choose other similarity 
coefficients, such as Tversky or Dice (38). The second 
algorithm identifies the maximum common substructure 
(MCS) shared among compound pairs (39). Subsequently, 
the size of both compounds and the size of their shared 
MCS is used to calculate the available similarity coeffi- 
cients. The underlying MCS algorithm often provides 



the most accurate and sensitive similarity measure, espe- 
cially for compounds with large size differences (40,41). 

Search toolbox 

To efficiently mine much of the chemical structure and 
bioactivity space available in the public domain, the 
ChemMine Tools service provides text and structure simi- 
larity search methods that interface with the PubChem 
database (15) via its SOAP-based Power User Gateway 
(PUG) data exchange feature. During an analysis 
session, instantaneous search functionality is often im- 
portant for retrieval of detailed property and annotation 
information for compounds of interest, or to identify 
related structures. In ChemMine Tools, structural similar- 
ity searches can be performed with PubChem' s fingerprint 
search engine or via the EI Search method. The latter was 
developed in house as part of this project to provide 
ultra-fast structure similarity search functionality using 
an embedding/indexing (EI) algorithm (42). When the fin- 
gerprint method is chosen, the query is sent to PubChem, 
where the structure search is performed and the results are 
returned to the compound workbench. In contrast to this, 
EI Search is specific to the ChemMine Tools project and 
thus, runs locally on its servers. These two tools possess 
complementary strengths and weaknesses in identifying 
weak similarities among compounds (42). 

Clustering toolbox 

Clustering of compounds by structural or property simi- 
larity can be a powerful approach to correlating 
compound features with biological activity. Clustering 
tools are also widely utilized for diversity analyses to 
identify structural redundancies and other biases in 
compound libraries. ChemMine Tools' clustering 
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workbench provides an online interface to three clustering 
algorithms which include hierarchical clustering, multidi- 
mensional scaling (MDS) and binning clustering (35). The 
following provides a short overview of these tools, while a 
more detailed outline of the underlying theory and clus- 
tering schemes is available in the online tutorial. When 
clustering by structural similarity, the required similarity 
measures are computed by first generating the atom pair 
descriptors (features) for each compound which are then 
used to calculate a similarity matrix based on the common 
and unique features observed among all compound pairs 
using the Tanimoto coefficient. The Tanimoto coefficient 
has a range from 0 to 1 with higher values indicating 
greater similarity than lower ones. For the subsequent 
clustering steps, the similarity matrix is converted into a 
distance matrix by subtracting the similarity values 
from 1. The hierarchical and MDS clustering methods 
provided by ChemMine Tools are based on the R 
programs hclust and cmdscale, respectively; the third 
method utilizes an internally developed C++ implementa- 
tion. These three programs complement one another with 
respect to their data outputs and visualization options. 
Hierarchical clustering organizes compounds by similarity 
in a tree with branch lengths proportional to the 
item-to-item (compound-to-compound) similarities, while 
the MDS output encodes this information in a scatter plot. 
These two methods do not directly provide assignments of 
compounds to discrete similarity groups; assignments are 
generated downstream of the actual clustering process 
using various post-processing methods, such as tree 
cutting approaches. The binning clustering output 
provides these groupings directly for a user-definable simi- 
larity cutoff. For instance, if a Tanimoto coefficient of 0.6 
is chosen then compounds will be joined into groups that 
share a similarity of this value or greater using a 'single 
linkage' rule for cluster joining. Final results are presented 
as interactive visualization pages to simplify the interpret- 
ation of the (often complex) clustering results. The hier- 
archical clustering result page uses the Google Maps API 
to generate zoom- and click-able trees aligned with mo- 
lecular structure images. Moreover, heat maps of user 
uploaded data containing compound property, activity 
or other information can be viewed alongside the tree. 
A similar system is used to present the MDS results 
as click-able scatter plots with cursor-over viewing of 
compound structures. The binning clustering results are 
presented in a table view containing (among other infor- 
mation) the cluster identifiers and the corresponding 
compound depictions. 

Property toolbox 

Predictions of small molecule physicochemical properties 
are important for assessing their 'druglikeness' and 
ieadlikeness' in silico (43,44). They are also useful for en- 
riching compound collections with desirable properties. 
For instance, the famous 'Lipinski Rule of Five' (45) is 
often applied to enrich compound collections with 
druglike candidates. This rule filters for compounds with 
<5 hydrogen bond donors, <10 hydrogen acceptors, a 
molecular weight <500 daltons and an octanol-water 



partition coefficient log P < 5. Physicochemical property 
data are essential for predicting bioactive and other 
properties of small molecules using modern machine 
learning approaches. These data are fundamental to the 
development of QSAR models (25). ChemMine Tools 
provides an online interface to the property prediction 
module of the JOELib package (32). This service can cal- 
culate 38 physicochemical property values, including 
Lipinski descriptors for custom compound sets. The re- 
sulting property tables can be downloaded or further pro- 
cessed on ChemMine Tools by sending them to the 
Clustering toolbox. There, they can be used to cluster 
compounds by similar property profiles, as described 
above, or the data can be visualized as a heat map next 
to the hierarchical clustering trees. 

CONCLUSION AND FUTURE DEVELOPMENT 

ChemMine Tools is an online service for compound 
analysis in the chemical genomics field. The service is 
unique in that it integrates a large number of cheminfor- 
matic programs with clustering and visualization 
functionalities. Additional outstanding features of 
ChemMine Tools include: (i) its commitment to publicly 
developed open source software throughout its infrastruc- 
ture; (ii) its strong dedication to the development of new 
cheminformatic tools and their free distribution in the 
community; and (iii) the integration of its many compo- 
nents into a unified online and downloadable software 
infrastructure which maximizes their utility for diverse 
tasks with different levels of complexity and customization 
needs. An intuitive web interface makes these tools access- 
ible to scientists with limited computational background, 
while simultaneously providing a programmable interface 
for advanced users. To the best of our knowledge, there 
are currently no related online services available that 
provide a comparable suite of functionalities. Overlaps 
exist, however they are limited to isolated functionalities. 
For instance, ChemDB and VCCLab (13,43) can be used 
for property predictions and structure format interconver- 
sions of single compound queries; and PubChem supports 
structure-based clustering for compounds retrieved from 
its own database. 

In the future, many additional utilities will be added to 
the ChemMine Tools service including the addition of 
MCS-based search functionality within the Similarity 
toolbox to support more complex graph-based search 
strategies against custom compound sets imported into 
the Compound workbench. Existing functionalities for 
analyzing bioactivity data will also be expanded by 
adding a Bioactivity toolbox that will contain regression, 
machine learning and QSAR modeling tools. 
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